FPGA accelerator for floating-point matrix multiplication

نویسندگان

  • Zeljko Jovanovic
  • Veljko M. Milutinovic
چکیده

This study treats architecture and implementation of a FPGA accelerator for double-precision floating-point matrix multiplication. The architecture is oriented towards minimising resource utilisation and maximising clock frequency. It employs the block matrix multiplication algorithm which returns the result blocks to the host processor as soon as they are computed. This avoids output buffering, and simplifies placement and routing on the chip. The authors show that such architecture is especially well suited for full-duplex communication links between the accelerator and the host processor. The architecture requires the result blocks to be accumulated by the host processor; however, the authors show that typically more than 99% of all arithmetic operations are performed by the accelerator. The implementation focuses on efficient use of embedded FPGA resources, in order to allow for a large number of processing elements (PEs). Each PE uses 8 Virtex6 DSP blocks. Both adders and multipliers are deeply pipelined and use several FPGA-specific techniques to achieve small area size and high clock frequency. Finally, the authors quantify the performance of accelerator implemented in Xilinx Virtex-6 FPGA, with 252 PEs running at 403 MHz (achieving 203.1 GFLOPS), by comparing it to DGEMM function from MKL, ACML, GotoBLAS and ATLAS libraries executing on Intel Core2Quad and AMD Phenom X4 microprocessors running at 2.8 GHz. The accelerator performs 4.5 times faster than the fastest processor/library pair.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

FPGA based dataflow accelerator for large matrix multiplication

Real-world numerical applications often require a huge number of calculations to be done in short time. The best way to speed-up these applications is to exploit a huge amount of data parallelism by parallelizing independent calculations. Multi-core processors do not have enough resources to achieve any significant utilization of available data parallelism. Instead of adding new CPUs, addition ...

متن کامل

Automated Generation of High-Performance Large-Scale Matrix Multiplication Accelerator on FPGA

Matrix multiplication (MM) is a key linear algebra routine which has been widely used in many application areas. In this work we provide a high-performance single-precision dense MM FPGA accelerator, and also an automatic generator to generate the accelerator with high throughput and high resource efficiency based on hardware and MM workload specifications. The accelerator adopts the linear sys...

متن کامل

The Algorithms for FPGA Implementation of Sparse Matrices Multiplication

In comparison to dense matrices multiplication, sparse matrices multiplication real performance for CPU is roughly 5–100 times lower when expressed in GFLOPs. For sparse matrices, microprocessors spend most of the time on comparing matrices indices rather than performing floating-point multiply and add operations. For 16-bit integer operations, like indices comparisons, computational power of t...

متن کامل

Energy Performance of Floating-Point Matrix Multiplication on FPGAs

Floating-point matrix multiplication is a basic kernel in scientific computing. It has been shown that implementations of this kernel on FPGAs can achieve high sustained performance [1]. However, to the best of our knowledge, existing work on FPGA-based floating-point matrix multiplication considers the optimization of latency or area only. In this paper, we analyze the impact of various parame...

متن کامل

FPGA-Based Design and Realization of Fixed and Floating Point Matrix Multipliers: A Review

Matrix multiplication is a computationally-intensive and fundamental matrix operation in many algorithms used in scientific computations. It serves as the basic building block for many signal, image processing, graphics and robotic applications. To improve the performance of these applications, a high performance matrix multiplier is required. Traditionally, matrix multiplication operation is e...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • IET Computers & Digital Techniques

دوره 6  شماره 

صفحات  -

تاریخ انتشار 2012